Project description¶

Analyze the assortment of goods based on the transactions made by an online store of household goods.

  • Conduct exploratory data analysis;
  • Analyse the trade assortment;
  • Make recommendations for the store management.

1. Studying the dataset¶

Import of libraries¶

In [1]:
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.express as px
import scipy.stats as stats

Import of the dataset¶

In [2]:
df = pd.read_csv('', sep=',')

First glance at the dataset¶

In [3]:
df.head()
Out[3]:
date customer_id order_id product quantity price
0 2018100100 ee47d746-6d2f-4d3c-9622-c31412542920 68477 Комнатное растение в горшке Алое Вера, d12, h30 1 142.0
1 2018100100 ee47d746-6d2f-4d3c-9622-c31412542920 68477 Комнатное растение в горшке Кофе Арабика, d12,... 1 194.0
2 2018100100 ee47d746-6d2f-4d3c-9622-c31412542920 68477 Радермахера d-12 см h-20 см 1 112.0
3 2018100100 ee47d746-6d2f-4d3c-9622-c31412542920 68477 Хризолидокарпус Лутесценс d-9 см 1 179.0
4 2018100100 ee47d746-6d2f-4d3c-9622-c31412542920 68477 Циперус Зумула d-12 см h-25 см 1 112.0

Conclusions:

  1. Column names are intuitive.
  2. Date format should be corrected: date and hour digits are merged into one body, and need to be converted into datetime format.
  3. Customer IDs are too long: add a column with shortened IDs to include in visualisations
  4. Add a column with shorter product names to make division of products into categories easier.
In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6737 entries, 0 to 6736
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   date         6737 non-null   int64  
 1   customer_id  6737 non-null   object 
 2   order_id     6737 non-null   int64  
 3   product      6737 non-null   object 
 4   quantity     6737 non-null   int64  
 5   price        6737 non-null   float64
dtypes: float64(1), int64(3), object(2)
memory usage: 315.9+ KB

Conclusions:

  1. Date format is incorrect
  2. Customer ID's and product names formats are to be turned into 'string'

Change of data formats¶

In [5]:
# int into string
df['customer_id'] = df['customer_id'].astype('string')
df['product'] = df['product'].astype('string')

# int into datetime
df['date'] = df['date'].astype('string')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d%H')
In [6]:
# check
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6737 entries, 0 to 6736
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   date         6737 non-null   datetime64[ns]
 1   customer_id  6737 non-null   string        
 2   order_id     6737 non-null   int64         
 3   product      6737 non-null   string        
 4   quantity     6737 non-null   int64         
 5   price        6737 non-null   float64       
dtypes: datetime64[ns](1), float64(1), int64(2), string(2)
memory usage: 315.9 KB

Time interval covered by the dataset¶

In [7]:
# building a histogrsam
df['date'].hist(bins=150);

# naming title and axes
plt.xlabel('time')
plt.ylabel('quantity')
plt.title('Events by date')
plt.show()
In [8]:
display(df['date'].min())
display(df['date'].max())

print("Records cover", df['date'].max() - df['date'].min())
Timestamp('2018-10-01 00:00:00')
Timestamp('2019-10-31 16:00:00')
Records cover 395 days 16:00:00

Conclusions:

  1. The dataset covers the period from October 2018 to October 2019, or 395 days.
  2. Most of the events occurred in May-June 2019, and such intensity may indicate the presence of duplicates, anomalies or technical errors.
  3. The fewest events are observed in January, as well as in the last months of 2019. Perhaps the store is gradually losing customers: if this assumption is correct, then we will also see a decrease in revenue and/or purchase volumes in these months.
In [9]:
df['quantity'].hist(bins=100);

plt.xlabel('quantity of goods')
plt.ylabel('records')
plt.title('Quantity of goods histogram')
plt.show()
In [10]:
# orders IDs histogram
display(df['order_id'].astype('int').hist())
<Axes: >

Conclusions:

Customers tend to buy 1-2 goods, i.e. this is a retail store, and wholesale purchases, if any, need to be excluded from further analysis.

Order numbers are sequential, so it is worth asking the store’s technical support specialists why the numbering is interrupted.

In [11]:
df['price'].hist(bins=100);

plt.xlabel('price')
plt.ylabel('records')
plt.title('Price of goods histogram')
plt.show()

Conclusions:

The price of the purchased goods generally does not exceed 1,000 RUB.

2. Data processing: dates, duplicates and anomalies¶

New date formats¶

In [12]:
# Y+M+W
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['week'] = df['date'].dt.isocalendar().week

# WD+H
df['dow'] = df['date'].dt.dayofweek + 1
df['hour'] = df['date'].dt.hour

# Y+M
df['ymonth'] = df['year'].astype('string') + df['month'].astype('string')
df['ymonth'] = pd.to_datetime(df['ymonth'], format='%Y%m')
In [13]:
# revenue calculation
df['revenue'] = df['price'] * df['quantity']
In [14]:
# six-digits customer IDs

# unique customers
display(df['customer_id'].nunique())
2451
In [15]:
# column with six-digits customer IDs
df['cid'] = [x.strip()[-7:] for x in df['customer_id']]

# check
display(df['cid'].nunique())
2451
In [16]:
# short product names column
df['prod'] = [x.split()[0] + ' ' + x.split()[1] for x in df['product']]

Blanks and duplicates processing¶

In [17]:
df.isna().sum() 
Out[17]:
date           0
customer_id    0
order_id       0
product        0
quantity       0
price          0
year           0
month          0
week           0
dow            0
hour           0
ymonth         0
revenue        0
cid            0
prod           0
dtype: int64
In [18]:
df.duplicated().sum()
Out[18]:
0
In [19]:
# let's see how many duplicates there are in our dataset
# the presence of such duplicates may indicate failures in the order system, or automatic orders

display(df.drop_duplicates(subset=['customer_id', 'order_id', 'product', 'quantity', 'price'])['date'].count() / df['date'].count())
0.7233189847112959

Such orders, which make up almost 28% of the dataset, should be excluded from the analysis

In [20]:
# deleting
df = df.drop_duplicates(subset=['customer_id', 'order_id', 'product', 'quantity', 'price'])
In [21]:
# next, we will check the adequacy of prices for goods worth more than 2000 RUB.
# were there any errors when entering these prices into system?

display(df.query('price >= 2000')
        .pivot_table(index='prod', values='price', aggfunc='max'))
price
prod
Tepмокружка AVEX 2399.0
Автоматическая щетка 7229.0
Афеляндра скуарроса 3524.0
Бак для 3749.0
Ведро для 3749.0
Весы напольные 2849.0
Гладильная доска 7424.0
Гладильная доска-стремянка 2399.0
Гортензия Микс 3599.0
Двуспальное постельное 2024.0
Доска гладильная 3299.0
Ерш для 3524.0
Карниз алюминиевый 2099.0
Коврик для 5474.0
Коврик придверный 2009.0
Комплект для 5399.0
Котел алюминиевый 2924.0
Мантоварка-пароварка WEBBER 2219.0
Мусорный контейнер 5512.0
Набор Vileda 2924.0
Набор инструментов 5399.0
Наматрасник Wellness 3074.0
Новогоднее дерево 3524.0
Одеяло Wellness 4724.0
Покрывало жаккард 6134.0
Полки QWERTY 4312.0
Пылесос DELTA 2249.0
Сиденье для 6149.0
Скатерть 350х150 2249.0
Скатерть Арлет 2174.0
Скатерть Джулия 2249.0
Стремянка 5 3974.0
Стремянка 7 7724.0
Стремянка COLOMBO 3449.0
Стремянка Colombo 2699.0
Стремянка FRAMAR 4499.0
Стремянка Scab 5549.0
Стремянка алюминиевая 4949.0
Стремянка-табурет алюминиевая 2699.0
Стремянки Colombo 3974.0
Сумка-тележка 2-х 2849.0
Сумка-тележка 3-х 2699.0
Сумка-тележка TWIN 2624.0
Сумка-тележка хозяйственная 8737.0
Сушилка Meliconi 5594.0
Сушилка для 7004.0
Сушилка уличная 14917.0
Урна уличная 7349.0
Урна-пепельница из 5287.0
Фал капроновый 2099.0
Цитрофортунелла Кумкват 3074.0
Швабра для 2624.0
Швабра хозяйственная 3224.0
Штора для 4424.0

We checked the prices of the above goods on the Internet and found that they do not differ significantly from those presented in the dataset

In [22]:
# let’s check if there are any orders attributed to several customers at once

bugged_orders = (
        df.pivot_table(index='order_id', values='customer_id', aggfunc='nunique')
          .reset_index()
          .sort_values('customer_id', ascending=False)
          .query('customer_id > 1')['order_id'] 
          .to_list()
                )
display(bugged_orders)
[72845,
 71480,
 69485,
 69310,
 69833,
 72790,
 72778,
 14872,
 71542,
 71054,
 71663,
 70726,
 69531,
 70542,
 70903,
 69283,
 71226,
 71571,
 69410,
 69345,
 70808,
 70114,
 70631,
 71461,
 72950,
 71648,
 70946,
 68785,
 72188]
In [23]:
# % of such orders

display(df.query('order_id in @bugged_orders')['order_id'].count() / df['order_id'].count())
0.013544018058690745

1,3%. To be deleted

In [24]:
# deleting

df = df.query('order_id not in @bugged_orders')

Conclusions:

We preprocessed the data, clearing the dataset from possible errors and duplicates We reduced the data to convenient types and added columns that will help us in further analysis

3. Analysis¶

Anomalies: seek and delete¶

First, let’s take a look at the distribution of the quantity of purchased goods in order to exclude wholesale purchases from the analysis.

In [25]:
sns.boxplot(x=df['quantity'], orient='h')

plt.xlabel('quantity of goods, in pcs')
plt.title('Quantity of goods boxplot')
plt.show()


# detailed boxplot
sns.boxplot(x=df['quantity'], orient='h')

plt.axis([0, 200, None, None])
plt.xlabel('quantity of goods, in pcs')
plt.title('Detailed quantity of goods boxplot')
plt.show()
In [26]:
# wholesale goods
display(df.query('quantity > 50')
              .pivot_table(index='prod', values='quantity', aggfunc='sum')
              .reset_index()
              .sort_values('quantity', ascending=False)
              .head(15)
              .plot(x='prod', y='quantity', kind='bar'));

plt.xlabel('Product')
plt.ylabel('quantity in pcs')
plt.title('Store sells the following wholesale goods')
plt.show()
<Axes: xlabel='prod'>

Conclusions:

There are many wholesale purchases in the dataset. Taking into account that the store’s assortment includes many small and/or paired items (forks, seeds, hooks etc.), we consider purchases of 50 pieces of the same product as wholesale

In [27]:
# % of wholesale
display(len(df.query('quantity > 50')) / len(df))

# wholesale revenue
display(df.query('quantity > 50')['revenue'].sum())
0.004784688995215311
880158.0
In [28]:
# checking that most revenue comes from the sale of single goods
# (there can be several records under one order ID)

revenue_top_quantity = (df.pivot_table(index='quantity', values='revenue', aggfunc='sum')
                      .reset_index()
                      .sort_values('revenue', ascending=False) 
                     ) 

revenue_top_quantity.head()
Out[28]:
quantity revenue
0 1 2258065.0
48 1000 675000.0
1 2 228022.0
9 10 107450.0
2 3 86046.0
In [29]:
# let's take a look at wholesale revenue trend before deleting the records
revenue_by_month = (df.query('quantity > 50')
                      .pivot_table(index='ymonth', values='revenue', aggfunc='sum')
                      .reset_index()
                      .sort_values('ymonth', ascending=True) 
                     ) 

revenue_by_month.plot(x='ymonth', y='revenue', kind='bar') 


plt.xlabel('month')
plt.ylabel('revenue in RUB')
plt.title('Wholesale revenue by month')
plt.show()

Therefore, wholesale purchases account for 0.5% of the data volume. From the table and monthly revenue chart we see that 700 kRUB of wholesale revenue comes from a single abnormal transaction recorded in June 2019.

Having excluded this sale, as well as transactions of more than 50 product pieces, we obtained adequate data on retail purchases

In [30]:
# exclusion of wholesale and abnormal records
df = df.query('quantity <= 50')
In [31]:
# prices boxplot
sns.boxplot(x=df['price'], orient='h')

plt.xlabel('price in RUB')
plt.title('Prices boxplot')
plt.show()


# detailed prices boxplot
sns.boxplot(x=df['price'], orient='h')

plt.axis([0, 2000, None, None])
plt.xlabel('Цена, у.е.')
plt.title('Detailed prices boxplot')
plt.show()

Conclusions:

Most goods are sold at prices ranging from 50 to 500 RUB a piece. Abnormal price starts at 1,000 RUB.However, we have previously analyzed prices above 2,000 RUB and concluded that there were no input errors or suspiciously high prices.

Therefore, we do not exclude additional data from the analysis.

Additional info¶

In [32]:
# number of unique customers
display(df['customer_id'].nunique())
display(df['order_id'].nunique())

# average number of orders made by one unique customer
display(df['order_id'].nunique() / df['customer_id'].nunique())
2375
2734
1.1511578947368422
In [33]:
# total revenue
display(df['revenue'].sum())

# revenue by one unique customer
display(df['revenue'].sum() / df['customer_id'].nunique())
3214457.0
1353.4555789473684
In [34]:
# products quantity
display(df['product'].nunique())

# products in average order
display(df['product'].count() / df['order_id'].nunique())
2318
1.7498171177761521

Conclusions:

  1. During the period under review, the store sold 2,318 unique products. Managers can compare this quantity with the total number of product items and estimate the volume of goods that were not sold at all.
  2. 2375 customers placed 2734 orders with the store. Therefore, the majority of clients made only one order during the period.
  3. Retail revenue for the period amounted to 3.2 mCU, with one customer accounted for an average revenue of 1,353 CU.

Products categorisation¶

In [35]:
# let's add a file with the "product" - "category" pairs that we have entered as a separate dataset
df_cat = pd.read_csv(r'C:\Users\kvkoz\Downloads\cat1.csv', sep=',')
In [36]:
df_cat.head()
Out[36]:
product cat
0 Комнатное растение в горшке Алое Вера, d12, h30 растения
1 Комнатное растение в горшке Кофе Арабика, d12,... растения
2 Радермахера d-12 см h-20 см растения
3 Хризолидокарпус Лутесценс d-9 см растения
4 Циперус Зумула d-12 см h-25 см растения
In [37]:
# category column
df = pd.merge(df, df_cat, on='product')

Let us perform a detailed data analysis so as to define nuances in sales trends, and come up with recommendations for the store management

In [38]:
# revenue and selling volumes diagram
dbm = df.groupby(['ymonth'])[['revenue','quantity']].sum()


fig = plt.figure()

ax = fig.add_subplot(111)
ax2 = ax.twinx()

width = 0.3


dbm.revenue.plot(kind='bar', color='magenta', ax=ax, width=width, position=1, label='revenue')
dbm.quantity.plot(kind='bar', color='cyan', ax=ax2, width=width, position=0, label='quantity')


plt.legend(loc="upper right")
ax.set_ylabel('revenue in CU')
ax2.set_ylabel('volume in pcs')
plt.title('Sales trends')

plt.show()

We are analysing the remaining sales of 2.8 mCU, cleared of wholesale and abnormal transactions.

The distribution of sales shows that peak sales were observed in autumn of 2018 and were gradually decreasing after: by the end of the analysed period, sales fell by 1.5 times, and in October 2019 sales were almost half as much as in October 2018.

At the same time, sales declines were observed in January and June. The January decline is possible due to long holidays, and the June decline may have occured due to the vacation season.

The distribution of volumes of goods sold basically repeats the distribution of revenue, with the exception of two points:

  1. The peak here is observed in April-May. There are probably a lot of inexpensive items being purchased during these months.
  2. There were no sales declines in June

Let's see how revenue is distributed among customers and orders.

In [39]:
# revenue and sales volume by top-10 customers
dbc = df.groupby(['cid'])[['revenue','quantity']].sum().sort_values('quantity', ascending=False).head(10)

fig = plt.figure() 

ax = fig.add_subplot(111)
ax2 = ax.twinx()

width = 0.3


dbc.revenue.plot(kind='bar', color='magenta', ax=ax, width=width, position=1, label='revenue')
dbc.quantity.plot(kind='bar', color='cyan', ax=ax2, width=width, position=0, label='quantity')

# добавляем названия
plt.legend(loc="upper right")
ax.set_ylabel('revenue in CU')
ax2.set_ylabel('quantity in pcs')
plt.title('Revenue and sales volume by top-10 customers')

plt.show()

We see that the two largest buyers account for 7% of all sales, but next ones account for less than 2%. They are also the largest buyers in terms of quantity of goods.

Let's see how often they made purchases

In [40]:
# revenue by month
display(df.query('cid in ["ee86d3b","e40d7db"]')
              .pivot_table(index='ymonth', values='quantity', aggfunc='sum')
              .reset_index()
              .sort_values('ymonth', ascending=True)
              .plot(x='ymonth', y='quantity', kind='bar')); # строим диаграмму

plt.xlabel('month')
plt.ylabel('quantity')
plt.title('Customer ee86d3b and e40d7db purchases')
plt.show()
<Axes: xlabel='ymonth'>

They made purchases up until March 2019, but no further. These are probably lost clients.

How are goods distributed among orders?

In [41]:
display(df.pivot_table(index='order_id', values='quantity', aggfunc='sum')
                      .reset_index()
                      .sort_values('quantity', ascending=False)
                      .head(10)
                      .plot(x='order_id', y='quantity', kind='bar')
                     ) 


plt.xlabel('order number')
plt.ylabel('quantity in pcs')
plt.title('Top-10 orders by quantity of goods sold')
plt.show()
<Axes: xlabel='order_id'>

Since we excluded wholesale orders, the fluctuation in the number of goods in the remaining orders is not significant.

Let's move on to the analysis of popular products

In [42]:
# top-10 products
display(df.pivot_table(index='prod', values='quantity', aggfunc='sum')
                      .reset_index()
                      .sort_values('quantity', ascending=False)
                      .head(10)
                      .plot(x='prod', y='quantity', kind='bar')
                     ) 


plt.xlabel('product name')
plt.ylabel('quantity in pcs')
plt.title('Top-10 products')
plt.show()
<Axes: xlabel='prod'>

From the graph above we can conclude that the store sells mostly plants and flowers, including artificial ones.

Let's see which product category is in greatest demand.

To do this, we assigned each of the products one of the following six categories: clothing and shoes, plants, home and everyday life, kitchen utensils, storage and transportation etc.

Categories analysis¶

In [43]:
# revenue and sales volumes by category
plotpie = (df.pivot_table(index='cat', values=['revenue','quantity'], aggfunc='sum')
                      .reset_index()
                      .sort_values('revenue', ascending=False) 
                     ) 



fig, (ax1,ax2) = plt.subplots(1,2,figsize=(12,12)) 


labels = plotpie['cat']
values = plotpie['revenue']
ax1.pie(values,labels = labels,autopct = '%1.1f%%') #plot first pie
ax1.set_title('Revenue by category')


labels = plotpie['cat']
values = plotpie['quantity']
ax2.pie(values,labels = labels,autopct = '%1.1f%%') #plot second pie
ax2.set_title('Volumes of goods sold by category')


plt.show();

Revenue and quantity of goods are distributed unevenly.

Firstly, 44% of sales come from plants, but they only account for 16% of revenue. It is likely that plants are not purchased all year round, so it is important to fill storages with them just before peak demand.

Secondly, the smallest category, “storage and transportation,” accounts for more than 21% of revenue. It is followed by clothing and shoes and household goods, each of which is responsible for a fifth of revenue.

Finally, kitchen and other products account for only 18% of revenue, marking 25% of sales. Perhaps the store should refrain from purchasing goods of such categories in favor of more profitable ones.

Let's see how sales are distributed over time: first by month, and then by day of week and hour

In [44]:
catsum = df.groupby(['ymonth','cat'])[['quantity']].sum()

catsum.columns = ['total_quantity']


fig = px.bar(catsum.reset_index().sort_values(by=['total_quantity'], ascending=False), 
             x='ymonth',
             y='total_quantity', 
             color='cat',
            )

fig.update_layout(title='Product sales by category',
                   xaxis_title='quantity in pcs',
                   yaxis_title='month',
                   yaxis={'categoryorder':'total ascending'})
fig.show() 

Seeking seasonality in sales

We confirmed our hypothesis that plant sales are determined by seasonality. Peak sales occur in spring, or April-May, when the planting season begins. The same fact responds to our earlier observation that sales peaks are observed in April-May. This information will be useful for the store to effectively manage warehouse space.

Sales of clothing and shoes, on the contrary, logically grow in cold autumn and winter months, tnen decrease towards summer. Same applies to kitchen and other products.

There is no pronounced seasonality in sales of household goods and containers, but they decline towards the end of the year. It is difficult to make an unambiguous conclusion about the long-term trend based on data for October 2018 and October 2019, but for all product categories, except for plants and containers, sales fell by 1.5-4 times by October 2019. This may indicate the store's low competitiveness in all categories except plants.

Now let’s build hitmaps to analyze sales by day of week and hour

In [45]:
heatmap_catd = (df.pivot_table(index=['cat','dow','hour'], values='quantity', aggfunc='sum')
              .reset_index()
              .sort_values('dow', ascending=True)
                  )
heatmap_catd
Out[45]:
cat dow hour quantity
0 дом и быт 1 0 4
249 одежда и обувь 1 15 6
250 одежда и обувь 1 16 3
251 одежда и обувь 1 17 9
252 одежда и обувь 1 18 18
... ... ... ... ...
603 растения 7 6 3
604 растения 7 8 46
469 прочее 7 12 8
231 кухонные принадлежности 7 16 8
740 хранение и перевозка 7 23 14

741 rows × 4 columns

In [46]:
plants_hm = (heatmap_catd.query('cat == "растения"')
                         .pivot(index='hour', columns='dow', values='quantity')
            )

fig, ax = plt.subplots(figsize=(5,5))
plt.title('Sales of plants by weekday and hour')
display(sns.heatmap(plants_hm, annot=True, fmt='.0f'))
<Axes: title={'center': 'Sales of plants by weekday and hour'}, xlabel='dow', ylabel='hour'>
In [47]:
clothes_hm = (heatmap_catd.query('cat == "одежда и обувь"')
                         .pivot(index='hour', columns='dow', values='quantity')
            )
 
fig, ax = plt.subplots(figsize=(5,5))
plt.title('Sales of clothes and shoes by weekday and hour')
display(sns.heatmap(clothes_hm, annot=True))
<Axes: title={'center': 'Sales of clothes and shoes by weekday and hour'}, xlabel='dow', ylabel='hour'>
In [48]:
appliances_hm = (heatmap_catd.query('cat == "дом и быт"')
                         .pivot(index='hour', columns='dow', values='quantity')
            )

fig, ax = plt.subplots(figsize=(5,5))
plt.title('Household items sales by weekday and hour')
display(sns.heatmap(appliances_hm, annot=True, fmt='.0f'))
<Axes: title={'center': 'Household items sales by weekday and hour'}, xlabel='dow', ylabel='hour'>
In [49]:
storage_hm = (heatmap_catd.query('cat == "хранение и перевозка"')
                         .pivot(index='hour', columns='dow', values='quantity')
            )

fig, ax = plt.subplots(figsize=(5,5))
plt.title('Storage items sales by weekday and hour')
display(sns.heatmap(storage_hm, annot=True))
<Axes: title={'center': 'Storage items sales by weekday and hour'}, xlabel='dow', ylabel='hour'>

In all cases, there are virtually no sales at night until 9 am. The peak occurs between 9 and 15 hours, and plants are actively purchased even after 15.

In addition to sales of containers, which are distributed more or less evenly throughout the day, most sales occur on Monday-Wednesday.

Since most orders are made during daytime, discounts could be introduced for orders placed late in the evening to smooth out the load on operators.

Let's see if we can separate categories of goods that are bought together with other goods from those bought separately.

Main and additional product assortments

In [59]:
# order ID and unique goods
orders_ops = (df.pivot_table(index='order_id', values='prod', aggfunc='nunique')
                      .reset_index()
                      .sort_values('prod', ascending=False)
                     ) 


# int
orders_ops['prod'] = orders_ops['prod'].astype('int')

# we leave only orders in which one unique product is purchased
orders_ops = orders_ops['order_id'].loc[orders_ops['prod'] == 1]

# series to list
orders_ops = orders_ops.tolist()
In [51]:
# add a column to the dataset that returns 1 if the order contains one unique product
df['sole_order'] = [x in orders_ops for x in df['order_id']]

df['sole_order']
Out[51]:
0       False
1       False
2       False
3       False
4       False
        ...  
4778     True
4779     True
4780     True
4781     True
4782     True
Name: sole_order, Length: 4783, dtype: bool
In [52]:
# sales volume by category
display(df.query('sole_order == True')
              .pivot_table(index='cat', values='quantity', aggfunc='sum')
              .reset_index()
              .sort_values('quantity', ascending=False)
              .plot(x='cat', y='quantity', kind='bar')); 

plt.xlabel('category')
plt.ylabel('quantity in pcs')
plt.title('If an order contains one product, what category does it belong to?')
plt.show()
<Axes: xlabel='cat'>
In [53]:
# sales volume by category
display(df.query('sole_order == False')
              .pivot_table(index='cat', values='quantity', aggfunc='sum')
              .reset_index()
              .sort_values('quantity', ascending=False)
              .head(15)
              .plot(x='cat', y='quantity', kind='bar')); 

plt.xlabel('category')
plt.ylabel('quantity in pcs')
plt.title('If an order contains several products, what category do they belong to?')
plt.show()
<Axes: xlabel='cat'>

So, we have the following ratio of the main and additional assortment for each category:

 1. Plants: 52% main and 48% additional
 2. Household items: 84% and 16%
 3. Other: 82% and 18%
 4. Kitchen apps: 82% and 18%
 5. Clothing and shoes: 85% and 15%
 6. Storage and transportation: 91% and 9%

Let's see what products are included in the main and additional range.

In [54]:
# top-3 products of each category
for i in df['cat'].unique():
    cats = (df.query('cat == @i')
        .groupby(['sole_order','cat','prod'])[['quantity']].sum()
        .sort_values(by=['cat','quantity'], ascending=False)
        .reset_index()
           )
    for k in df['sole_order'].unique(): 
        display(cats.query('sole_order == @k').head(3)) 
sole_order cat prod quantity
3 False растения Пеларгония зональная 239
5 False растения Пеларгония розебудная 156
6 False растения Рассада зелени 149
sole_order cat prod quantity
0 True растения Искусственный цветок 323
1 True растения Цветок искусственный 298
2 True растения Пеларгония зональная 249
sole_order cat prod quantity
10 False одежда и обувь Вешалка деревянная 20
11 False одежда и обувь Вешалка для 19
12 False одежда и обувь Плечики пластмассовые 19
sole_order cat prod quantity
0 True одежда и обувь Сушилка для 278
1 True одежда и обувь Гладильная доска 125
2 True одежда и обувь Вешалка для 81
sole_order cat prod quantity
5 False дом и быт Щетка для 52
7 False дом и быт Щетка-утюжок с 50
15 False дом и быт Набор вешалок 28
sole_order cat prod quantity
0 True дом и быт Ёрш унитазный 103
1 True дом и быт Коврик придверный 98
2 True дом и быт Щетка-сметка 4-х 90
sole_order cat prod quantity
10 False хранение и перевозка Банка стеклянная 7
15 False хранение и перевозка Короб стеллажный 4
17 False хранение и перевозка Ящик для 3
sole_order cat prod quantity
0 True хранение и перевозка Сумка-тележка хозяйственная 111
1 True хранение и перевозка Сумка-тележка 2-х 92
2 True хранение и перевозка Тележка багажная 65
sole_order cat prod quantity
5 False кухонные принадлежности Нож кухонный 33
6 False кухонные принадлежности Кружка НОРДИК 30
12 False кухонные принадлежности Тарелка суповая 13
sole_order cat prod quantity
0 True кухонные принадлежности Таз пластмассовый 121
1 True кухонные принадлежности Тарелка обеденная 119
2 True кухонные принадлежности Тарелка десертная 80
sole_order cat prod quantity
8 False прочее Муляж Апельсин 29
10 False прочее Муляж Яблоко 25
14 False прочее Муляж Красное 20
sole_order cat prod quantity
0 True прочее Муляж Яблоко 124
1 True прочее Муляж Банан 105
2 True прочее Муляж Лимон 91

The differences in popular products of different assortments are as follows:

  1. Plants:
  • main: artificial flowers and pelargonium
  • additional: pelargonium and seedlings
  1. Home and life:
  • brushes, mats, brushes
  • toilet brushes and hangers
  1. Other: both include artificial fruits but of different kinds
  2. Kitchen:
  • basins and plates
  • knives, mugs, plates
  1. Clothes and shoes:
  • dryers, ironing boards and hangers
  • hangers and hangers with shoulder shapers
  1. Storage and transportation:
  • bags and trolleys
  • cans, boxes, crates

Thus, the assortments often overlap, but in terms of quantity we can distinguish the following leaders in the main assortment:

  1. Artificial flowers
  2. Dryers and ironing boards
  3. Bags and trolleys
  4. Basins
In [55]:
# categories frequency
display(df.query('sole_order == False')
              .pivot_table(index='order_id', values='cat', aggfunc='nunique')
              .reset_index()
              .sort_values('cat', ascending=False)
              .hist('cat',bins=6)
       ); 

plt.xlabel('category')
plt.ylabel('orders quantity')
plt.title('200 orders contain products of a single category')
plt.show()
array([[<Axes: title={'center': 'cat'}>]], dtype=object)
In [56]:
# frequency diagram of product categories in orders with several unique products
display(df.query('sole_order == False')
              .pivot_table(index='cat', values='order_id', aggfunc='nunique')
              .reset_index()
              .sort_values('order_id', ascending=False)
              .plot(x='cat', y='order_id',kind='bar')
       ); 

plt.xlabel('category')
plt.ylabel('goods in pcs')
plt.title('Of the 277 such orders, more than 200 contain plants')
plt.show()

# сколько всего таких заказов?
display(df.query('sole_order == False')['order_id'].nunique())
<Axes: xlabel='cat'>
277

We can see that goods of all categories, excluding plants, in 5 cases out of 6 (or more often) are purchased separately. Plants are purchased separately in 40% of cases.

However, there may be situations where an order contains several unique products, but they all belong to the same category. We found out that there are more than two hundred such orders. Moreover, out of 277 orders that contain 2 or more unique products, 200 contain plants.

Thus, although 60% of plants are purchased together with other goods, these other goods are mostly also plants.

Hypotheses about the equality of revenue and quantity between the main and additional assortments¶

Criterion for conclusion about the significance/insignificance of differences: Non-parametric Wilcoxon-Mann-Whitney test

Level of significance: 0.05

Null hypothesis: According to cleaned data, there are no statistically significant differences in the average revenue between the products of the main and additional ranges

Alternative hypothesis: The differences in average revenue between the main and additional assortment products according to the cleaned data are statistically significant

In [57]:
print('p-value:', "{0:.3f}".format(stats.mannwhitneyu(df[df['sole_order']== True]['revenue'], 
                                          df[df['sole_order']== False]['revenue'])[1]))
print('relative difference in average revenue','{0:.3f}'.format(df[df['sole_order']== True]['revenue'].mean()/df[df['sole_order']== False]['revenue'].mean()-1)) 
p-value: 0.000
relative difference in average revenue 3.764

Criterion for conclusion about the significance/insignificance of differences: Non-parametric Wilcoxon-Mann-Whitney test

Level of significance: 0.05

Null hypothesis: According to cleaned data, there are no statistically significant differences in the average quantity between the products of the main and additional assortments

Alternative hypothesis: The differences in the average quantity between the products of the main and additional assortments according to the cleaned data are statistically significant

In [58]:
print('p-value:', "{0:.3f}".format(stats.mannwhitneyu(df[df['sole_order']== True]['quantity'], 
                                          df[df['sole_order']== False]['quantity'])[1]))
print('relative difference in average quantity','{0:.3f}'.format(df[df['sole_order']== True]['quantity'].mean()/df[df['sole_order']== False]['quantity'].mean()-1)) 
p-value: 0.000
relative difference in average quantity 0.597

In both cases, p-value is less than the significance level, so we accept the hypothesis that there are no differences in both average revenue and the average number of goods sold in the main and additional ranges.

5. Overall conclusions and recommendations¶

Having analysed the dataset, we came to the following conclusions:

 1. Sales data covers 1 year and 1 month: from October 2018 to October 2019.
 2. We found gaps in the numbering of orders - a question to technical support about its adequacy ensues.
 3. This is a retail store selling goods mainly under 1,000 CU.
 4. No full duplicates were found, but there are both orders duplicated at different times and erroneous buyer-order pairing. They took up about 30% of the dataset.
 5. Wholesale sales accounted for 4.5% of operations and a third (1.3 mCU) of revenue, including a single purchase of 1,000 units of goods (probably an erroneous entry).
 6. Retail sales amounted to 2.8 mCU. Monthly trend indicates a gradual reduction in revenue both in monetary and physical terms. The largest purchases in terms of volume are observed in April and May, during peak plant sales.
 7. Sales structure is heterogeneous:

     First, more than half of sales come from plants, but they only account for 17% of revenue. It is likely that seeds for planting are not purchased all year round, so it is important to fill storage areas with them just before peak demand.

     Secondly, the smallest category, “storage and transportation,” brings in the most (a quarter) of revenue. It is followed by clothing and shoes, and household goods, each of which is responsible for a fifth of revenue.

     Finally, kitchen and other products account for only 15% of revenue and for 17% of quantity. Perhaps the store should refrain to purchase goods of categories in favor of more profitable ones.
 8. Seasonality is observed in sales of plants (April-May) and clothing and shoes (autumn-winter). Distribution by day of the week is almost uniform, with the bulk of sales occurring in the first half of the day, or from 9 a.m. to 3 p.m. (up to 6 p.m. for plants).
 9. 85% of products in all categories, excluding plants, are sold separately. Plants are sold both separately (40%) and together with other products. However, most of these other products are other plants.

Recommendations:

 1. Prepare unified instructions for order operators on how to enter data into the database so that there are no omissions or erroneous orders.
 2. It may be worth coming up with profitable offers for wholesale buyers: there are few of them, but they account for third of the revenue.
 3. Make adjustments to the procurement and warehouse management system. Plants take up the most space, but are sold almost only in spring. Accordingly, in autumn-winter it would be possible to free up more space for other categories of goods: clothes and shoes needed in cold weather, and containers that bring in more revenue.
 4. The purchase of kitchen and other goods should be reduced or abandoned in favor of more profitable ones.
 5. Consider a system of discounts when ordering goods in the evening to relieve operators.